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A theoretical model for the folding of proteins containing disulfide bonds is introduced. The 
model exploits the knowledge of the native state to favour the progressive establishment of native 
interactions. At variance with traditional approaches based on native topology, not all native bonds 
are treated in the same way; in particular, a suitable energy term is introduced to account for the 
special strength of disulfide bonds (irrespective of whether they are native or not) as well as their 
ability to undergo intra-molecular reshuffling. The model thus possesses the minimal ingredients 
necessary to investigated the much debated issue of whether the re-folding process occurs through 
partially structured intermediates with native or non-native disulfide bonds. This strategy is applied 
to a context of particular interest, the re-folding process of Hirudin, a thrombin-specific protease 
inhibitor, for which conflicting folding pathways have been proposed. We show that the only two 
parameters in the model (temperature and disulfide strength) can be tuned to reproduce well a set 
of experimental transitions between species with different number of formed disulfide. This model 
is then used to provide a characterisation of the folding process and a detailed description of the 
species involved in the rate-limiting step of Hirudin refolding. 

INTRODUCTION 

The characterisation of the folding pathway of proteins is one of the fundamental problems in molecular biology 
and is under an increasing scientific attention due to the continuous advancements in experimental and theoretical 
biochemistry. After the work of Anfinsen 1] , who demonstrated that ribonuclease unfolds and refolds reversibly into its 
native (active) three-dimensional structure, it has generally been accepted that the primary sequence usually contains 
sufficient information to direct the complete folding process. What typically remains elusive to experimental and 
theoretical investigations is the pathway of this spontaneous process and the mechanisms that govern it. 

A considerable progress in this direction is possible for proteins containing native disulfide bonds. The formation 
of disulfide bonds during the folding process can be controlled experimentally through the use of an appropriate 
thioldisulfide redox couple and thiol quenching agent ^2, jl]. By these means the regeneration process can be 
halted, the intermediate species can be trapped, isolated and characterised. Historically, one of the most investigated 
proteins containing disulfide bonds has been the bovine pancreatic trypsin inhibitor (BPTI). Starting with the work 
of Crcighton 5j a series of crucial studies have accumulated a wealth of evidence in favour of the existence of partially 
structured intermediates along the protein folding pathway. Despite these efforts, the detailed characterisation of the 
intermediates turned out to be a delicate experimental matter and there is still not a universal agreement on whether 
intermediates contain native or non-native disulfide bonds0, 0, 0| and if there exists more than one pathway^ 
0]. In this context, the use of theoretical and computational tools 0, 0, 0, O, E| has been extremely useful in 
complementing the experimental findings with a more precise characterisation of the folding pathway, albeit obtained 
for models that greatly simplify the complexity of the real system. 

In this paper we propose a theoretical scheme to study the folding of proteins with disulfide bonds by suitably 
exploiting the (known) protein native structure. At a general level our strategy falls in the class of approaches that 
build on the importance of the native state topology in steering the folding process 13|, that is in bringing into contact 
pairs of amino-acid that are found in interaction in the native state. In the past few years an increasing number of 
experimental and theoretical studies have confirmed the utility of these approaches in the characterisation of various 
aspects of protein folding processes El E E IH H Ellfl IM f22, 23, 24, 25, 26]. In the present work we try to 
generalise this strategy by adding a suitable treatment of disulfide bonds accounting both for their strength as well 
as their capability to undergo intra-molecular reshuffling. 

In order to validate this approach we investigate the folding process of the N-terminal core domain of hirudin IIM2 
from the blood-sucking leech Hirudinaria manillensis 27] . Hirudins are the most potent and specific inhibitors of 
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thrombin (a key enzyme in blood coagulation) identified so far, and they are currently used as effective anticoagulants. 
Hirudin HM2 is composed of a compact N-terminal domain (residues 1-47) stabilised by three disulfides (Cys6-Cysl4, 
Cysl6-Cys28, Cys22-Cys37) and a highly fiexiblc, negatively charged C-terminal tail. Structural studies conducted 
on several leech-derived disulfide- rich small proteins (i.e., hirudin, decorsin, and antistasin) reveal that although these 
proteins display negligible sequence similarity and different function they share a common disulfide topology and 3D 
fold 113, suggesting that leeches use the same protein scaffold but different binding epitopes to affect hemostasis. 
Notably, it has been found that these leech antihemostatic proteins possess a T-knot scaffold closely similar to that 
observed for other unrelated proteins, including b-transforming growth factors, wheat germ agglutinins and snake 
venom toxins H^l • In this view, the results of our study may have relevant and more general implication on the 
elucidation of the folding pathway(s) of other protein systems. 

The first experimental attempts to identify the folding pathway of Hirudin date back to the studies of Otto and 
Seckler who argued that Hirudin could be a viable alternative to BPTI and showed that it was experimentally feasible 
to obtain and follow its unfolding /refolding processes ■ As for the case of BPTI, also Hirudin has been studied in 
different experiments [U iM] leading to alternative formulations of its folding pathway 0| ■ In particular 

when dissolved atmospheric oxygen was used as oxidising agent, the folding process appeared to occur first through 
the establishment of three non- native (scrambled) disulfides and later through their slow rearrangement into the native 
bonding pattern. On the contrary, no evidence for the importance of these fully-oxidised disordered intermediates 
was found in the experiments of Thannauser et al. carried out using oxidised DL-ditiothreitol (DTT°^). 

These alternative views prompted the present investigation of the Hirudin pathway within a topology-based model. 
To ascertain that, despite its simplified nature, the model was suitable to characterize the main aspects of the folding 
process we have first tuned and validated it against the set of experimental measurements provided in ref . [31 ■ Based 
on the success of this comparison we have undertaken a detailed characterisation of the folding process by monitoring 
quantities inaccessible in previous experiments. Our study confirms the experimental evidence of ref. j3j| suggesting 
that, under a certain set of experimental conditions, the rate-limiting step of the refolding process involves a transition 
from a species with two disulfides to ones with the three native disulfides 's^. A detailed analysis of the numerical 
dynamical trajectories further reveals the precise order of formation of the disulfide bonds. 

THEORY AND RESULTS 

The starting point of our analysis is the 1-47 fragment of Hirudin 'ss'l resolved by NMR "SS*! shown in Fig. ^ (see 
Methods and Materials Section). The fragment under consideration contains three disulfide bonds between residues 
6-14, 16-28, 22-37 and differs from the whole protein because it lacks the 18-residues long tail; this tail plays an 
important role in the biological activity of the protein. However, it is highly mobile due to the virtual absence on any 
non-covalent contacts with the 1-47 globular fragment. For this reason we neglected the Hirudin tail in our numerical 
characterisation of the folding process. 




FIG. 1: Native structure of the 1-47 fragment of hirudin. The six cysteine residues, 6, 14, 16, 22, 28, 37 have been highlighted. 
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The effective energy scoring function that we adopted belongs to the class of topology-based Hamiltonians 
The knowledge of the above mentioned native structure, F^, is exploited to construct an effective energy function 
that admits as the lowest energy state. The simplest form of such energy function is as follows: 

i?(F) = -^A,,.(F^).A,,-(F) , (1) 

where E{T) is the energy of a trial conformation F and A(r) is its contact matrix, whose elements are 1 [0] if amino 
acids i and j are [not] in interaction. A standard criterion based on Cq or C/j distance of pairs of amino acids is used 
to decide whether two amino acids are interacting. 

In the simple case of eqn. the energy minimum is attained when all native bonds are established. Such 
native interactions are weighted equally. This is a simplification that can be justified when the effective amino acid 
interactions in the protein are of comparable strength and has the advantage of keeping the model transparent by 
avoiding the use of imperfect energy parametrisation ,37^ i3^ ,39„ . 

It is important to notice that the equal weighting of native contacts does not imply that, in a thermalised ensemble, 
they are formed with equal probability. In fact, it is the complex interplay of energy and structural entropy that 
dictates the most probable routes to the native state from an unfolded conformation as well as the presence of 
rate- limiting steps due to the establishment of crucial sets of native contacts 0|. 

In the present study, the equal weighting of native contacts does not appear to be a good starting point since one 
has to account for the very energetic disulfide bridges that can occur between pairs of Cys. For this reason, in place 
of iQJ we adopt a scoring function consisting of two terms (see also section Methods and Materials): 

n^Vn-ss + ^iVss . (2) 

Vn-ss enforces some general constraints on the peptide chain geometry and promotes the formation of native contacts 
between pairs of amino acids other than Cys-Cys. These contacts are weighted in the same manner. The second term, 
Vss rewards the formation of disulfide bonds between pairs of cysteines. It is important to stress that the formation 
of disulfide bonds is allowed among any pair of Cys, not only the native ones. By doing so we can investigate the 
extent to which species with one or more non-native disulfides are present and if they influence the dynamics toward 
the native state, as proposed in recent experiments. 

The strength of the disulfide bonds relative to other non-covalent interactions is controlled by the parameter /i. 
This parameter should not be regarded as a relative measure of the "bare", i.e. in vacuum, disulfide strength. In 
fact, since our model does not treat the solvent explicitly, /i captures the effective strength of disulfide bonds in the 
presence of water and any other appropriate reducing/oxidising agent. For this reason the value of and also that 
of the heat bath temperature, T, have to be chosen in a suitable way so to reproduce as accurately as possible the 
conditions of a given experiment. 

In this study we shall focus on the set of experiments carried out by Thannhauser et al. where hirudin was 
refolded in the presence of various concentrations of DTT°^. We shall first show that there exists a well-defined region 
in the fJ. — T parameter space where the rates of conversion between species with different numbers of disulfides match 
well those observed in experiments. This is an important fact since it proves that, despite its simple form, the energy 
scoring function of eqn. (O can indeed be used to characterize the folding process obtaining the correct quantitative 
experimental picture. Based on such stringent validation of our strategy, we shall then monitor quantities that are 
inaccessible in current experiments and thus propose a vivid and detailed picture for the hirudin refolding process. 

We want to stress the fact that the possibility to reproducing the results of Thannhauser et al. Q for a suitable 
choice of the ^i — T parameters support the hypothesis that our model is general enough to study any folding process 
involving the formation of disulfide bonds. 

Comparison with experimental rates 

The experimental benchmark for our model is provided by a series of hirudin refolding experiments carried out 
by Thannhauser et al. |34j under various concentrations of DTT°^. By using several combined techniques it was 
established that the refolding process occurs under the reaction shown in Fig. |21where i?, 15* and 25 denote respectively 
the species with 0, 1 and 2 disulfides (either native or non-native). A certain species with three disulfides, denoted 
as 35'* was seen to convert with extreme rapidity to native hirudin and was, therefore, identified with a native 
arrangement of the three disulfide contacts. The remainder of the ensemble of structures with three disulfides (i.e. 
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with at least two non-native disulfides) is therefore denoted as 3S. The experimental characterisation of Thannhauser 
et al provides the rates for the individual reactions of the above model for a few choices of the initial concentrations 
of DTT°^ and the reduced protein species. Our first goal was to see if, for suitably chosen values of T and /x it was 
possible to reproduce such rates. 

R . IS . 2S ■ 3S 



3S 



FIG. 2: Best fit model of Thannhauser et al. [3^ to the hirudin refolding experiments. Species with 0, 1 and 2 disulfides are 
denoted as R, IS and 25, respectively. The species denoted as 3S* contains the three native disulfides. The remainder of the 
ensemble of structures with three disulfides is therefore denoted as 35*. 



Several values of T and /i were considered and, for each of them, we measured the rates of conversion between the 
R, IS, 2S, 3S and 35** species by using a Monte Carlo technique. This was done by keeping track of how many 
transitions out of any given species occurred and how many of them ended up in each of the other species. To ensure 
that the obtained dynamics could be interpreted as a (coarse grained) viable dynamical trajectory \^ only tiny local 
distortions of the peptide chain were attempted in the Monte Carlo moves. During such coarse-grained trajectories 
the formation, breaking or reshuffling of disulfide bonds obeys a set of constraints dictated by the chemistry of the 
disulfide bonds and the thiol-disulfide coupling. In particular no Cys residue is allowed to participate to more than one 
disulfide; the number of disulfide bonds can decrease or increase by at most one unit at each time step, respectively 
when an existing disulfide is broken or through the establishment of a new disulfide between two previously unbonded 
Cys residues. Intramolecular rearrangements where one bond is broken in favour of a new one (thus preserving the 
total number of disulfides) are also allowed under the requirement that the new bond involves one of the cysteines of 
the broken disulfide. 

A particular care is necessary to ensure that the Monte Carlo dynamics subject to these constraints does not 
violate detailed balance. The difficulties arise from the fact that an elementary MC distortion of distinct starting 
configurations may result in structures that are compatible with a different number of allowed disulfide bonding 
patterns. 

To overcome this problem we have proceeded as follows: to a newly generated conformation T we randomly associate 
one of the 15 possible pairing patterns of the three disulfides. If the proposed bonding pattern of the new structure 
violates the previous criteria, the structure is rejected and time is incremented. Otherwise, the energy function is 
recalculated and the structure is rejected or accepted with the usual Metropolis criterion (see Methods for further 
details). In this way, detailed balance is obviously satisfied, since the same number of bonding patterns is proposed 
for each structure. The downside is that one encounters frequent Metropolis rejections due to the "blind" proposal of 
bonding patterns. 

For each of the explored values of T and /i we have measured the correlation between the experimental rates and 
those obtained numerically. As a measure of the degree of correlation between these two quantities we used the non- 
parametric Kendall analysis • This statistical tool allows to establish if there is a relationship between two sets 
of data and how statistically significant it is. Being based on the comparative ranking of corresponding data in the 
two sets, the Kendall analysis does not rely on any pre-assigned parametric dependence (e.g. linear) between the two 
quantities. For this reason, it is regarded as a very robust measure of correlation and appears particularly appropriate 
in this context where the measured rates (both experimental and numerical) span several orders of magnitude |42| | . 

Our findings are summarised in Fig. Owhere we have shown the contour and density plot of the Kendall correlation 
coefficient, r, against the rates pertaining to the experimental conditions of Fig. 5a in ref. ■ These effective 
first-order like rates where obtained starting from the best-fit experimental rates of Table 2 in ref [3 multiplied by 
the asymptotic concentration of DTT°^ or DTT'''^'^ measured under the given experimental conditions (see Fig. 6 in 
the same reference). A darker/lighter shadow in Fig. 13 denotes the presence of a higher/lower degree of correlation. 
It is apparent that the experimental data are well reproduced in the neighbourhood of fi — 2.8, T = 1.0 for which 
the highest correlation r = 0.81 is observed. To ascertain the statistical significance of observing such correlation, 
we have computed the probability to observe, by pure chance a correlation larger than the observed one. It turns 
out that this probability (double-sided) is equal to p — 0.01, which testifies the statistical significance of the observed 
correlation. This establishes that there is a strong monotonic relation between the experimental and theoretical rates. 

A scatter plot of the logarithm of experimental rates versus the numerically obtained ones is provided in Fig. ^ 
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FIG. 3: Contour and density plot of the Kendall correlation across the T — jj. plane. The white regions correspond to the 
minimum values of the correlations were observed (r ~ 0). The highest correlation, r — 0.81, was observed in the dark area 
located around ^ = 2.8, T — 1.0. 

where the good interdependence of the data can be visually inspected. If one were able to model the folding process 
with detailed and accurate energy functions one would expect an equality of the theoretical and experimental rates, a 
part from a time conversion factor. In our case, due to the simplicity of our model we do not observe such dependence 
but encounter, instead, another simple linear relationship between the logarithms of both sets of rates. This is visible 
in the linear fit of Figure|3| the corresponding correlation coefficient is r = 0.88 . Its statistical significance (two-sided) 
over the set of 7 data points is 4% which is compatible with the significance of the more general (and robust) Kendall 
correlation. These values indicate that the correlation is highly significant from the statistical point of view. 

A further validation can be carried out by comparing the asymptotic concentration of the various species observed 
in the experiment and in an equilibrated MC trajectory. In case of a perfect correlation between the experimental 
and numerical rates, this further check would be redundant. In this context, where the correlation is not perfect (a 
significant discrepancy is seen for the transition between the SS and 25" species) this validation is useful to ascertain 
whether the differences in the rates result in significantly different equilibrium conditions. The plot of Fig. reveals 
that the asymptotic concentrations are in good accord and hence by setting T — 1 and — 2.8 one can be confident 
that the MC trajectory is compatible both the dynamical and equilibrium properties of the experimental system. 

Thermodynamics 

It is interesting to analyse the thermodynamic behaviour of the system at ii — 2.8 as a function of the heat bath 
temperature, T. By using the standard, yet powerful method of histogram re-weighting (see Methods) we have 
computed the average internal energy as a function of T. The data, shown in Fig. Efa) indicate the presence of a 
point of inficction for temperatures close to 1. The presence of a transition at this temperature is further corroborated 
by the behaviour of the specific heat which displays a clear peak at T = Tp « 0.91, the folding temperature, and 
by the presence of two minima in the free energy profile in Fig. I3b) . 

This peak is associated to the folding transition in the system and its neatness suggests that the folding process 
has a two-state character This is indeed confirmed by an analysis of the free energy landscape at Tp which 

exhibits two minima as a function of Vn-ss in correspondence of the unfolded and folded states (data not shown). 
The special point /n = 2.8, T = 1.0, corresponding to the experimental conditions of ref. ,34] appears therefore to be 
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FIG. 4: (a) Linear correlation between the log of experimental rates of Figs. 5a ref. pi and those obtained through the present 
numerical calculation for fi = 2.8, T — 1.0. The experimental rates are expressed in min~^. The points in the plot correspond 
to the following transitions A: R ^ IS, B: IS ^ 25, C: 2S ^ 35, D: 25 ^ 35*, E: 15 ^ R, F: 25 ^ 15, G: 35 ^ 25. (b) 
Scatter plot of the equilibrium fraction of the various species obtained in the Monte Carlo trajectory and in the experiment. 
The experimental concentrations were extracted from Fig. 5a of ref'.34i for t = 500 s. 
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FIG. 5: (a) Total energy and specific heat for /i — 2.8. (b) Free energy profile, as a function of the internal energy, at the point 
Ai = 2.8, r = 0.91. 



located slightly above the folding transition temperature (for fi = 2.8). This is entirely consistent with the fact that 
the experimental conditions of ref. js^l, Fig. 5a, do not particularly favour the formation of the native state which, 
indeed, involves only a few percent of the total equilibrium population. 

We have portrayed in Fig. Elthe free energy landscape for the different reduced species as a function of Vn-ss- It 
can be seen that the free energy profiles have minima at different energy values according to the number of correctly 
formed disulfides. The R, 15, 25* and 35 species, in fact, display a minimum near i? « which denotes an unfolded 
ensemble (see Fig. \^). On the contrary, the species with three native disulfides has a free-energy minimum for 
energies much closer to the native state energy. 

The relative high values of free energy associated to the 35* species reflect the particular experimental conditions 
reproduced here where the asymptotic fraction of species with native disulfides is low. Clearly, by lowering the 
temperature one favours the formation of the native structure which is accompanied by an increase of the concentration 
with native bonding patterns for the cysteines. This effect is visible in Fig. \7}^ where the average fraction of formed 
disulfide bonds is portrayed as a function of temperature. It can be seen that above Tp one has a significant formation 
of non- native bonds, that are superseded by native ones below Tp. 

An alternative picture to this one was put forward by Chatrenet and Chang based on a hirudin refolding experiment 
carried out in ref. j3j| . Due to the much more oxidising conditions than the one considered in ref. |34| , Chatrenet and 



7 




FIG. 7: Average number of total (thick continuous line), native (continuous line) and non-native (dotted line) disulfide bonds 
for (a)^i = 2.8 and {h)fj, = 10. 



Chang [321 observed that the folding of hirudin occurred through the formation of intermediates with three (typically 
non-native) formed disulfides. Through a slow reshuffling process the disulfides would then rearrange in the native 
pairing from which the native conformation could be easily reached. 

Our findings indicate that this alternative scenario could be probably captured by our model by using larger values 
of the effective disulfide strength fi. This is consistent with the different oxidising solvent conditions |3J| adopted by 
Chatrenet and Chang. 

To investigate this regime we explored the value of ^ = 10. It is important to point out that the folding transition 
temperature (identified through the location of the specific heat peak) is almost insensitive to the disulfide strength 
in the range < /x < 10. By comparing the plots in Fig. [7|it is then possible to see that for the higher value of n the 
total number of formed disulfide bonds is higher than for fi = 2.8. This result, accompanied by the fact that the total 
energy depends weakly on fj., confirms the intuition that, for large fi's the disulfides are established at early stages 
of the folding process when the rest of the protein is still unstructured. This "greedy" disulfide formation not only 
impacts on the total number of formed disulfides but also on the relative fraction of correct (native ones). As a result, 
a much higher fraction of wrong bonds is found at any temperature, as noticeable in Fig. [7}p. This picture suggests 
that the large /i regime of our model may be compatible with the scenario proposed by Chatrenet and Chang where 
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the rate-limiting step corresponded to the disentanglement of non-native disulfides in intermediates states with three 
formed disulfide bonds. However, for this alternative case the experimental rates are not available and we cannot 
corroborate in a more quantitative way the parallel between the experimental conditions of Chatrenet and Chang and 
the "large" /i regime in our model. 



FOLDING PATHWAYS 



So far we have focused on the overall thermodynamic characterisation of the refolding of hirudin, paying a particular 
attention to the validation of our model against experimental data. Having established that the detailed experimental 
reaction rates of ref. ,34] can be well reproduced we can confidently use our model to investigate finer aspects of the 
refolding process. 

We begin by examining the details of formation of the native arrangement of disulfides. Our interest is to find 
whether the formation of the 3S* species is aided by the establishment of some non-native disulfides that are later 
broken in favour of native bondings. To do so, instead of subdividing the structures according to just the overall 
number of disulfides we indexed them with a pair of numbers, {nc,nw) denoting the number of correct (native) 
disulfides, Uc and wrong (non-native) ones, riw In this way the fully reduced state, R, is indicated as (0,0) while the 
35"* state corresponds to (3,0). Altogether there are nine possible states, as indicated in FiglHl 



(1,0)' 



.(2,0) 



3S* = (3,0) 



R = (0,0) 



(1,1) 



(1,2) 
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2S 



3S 
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FIG. 8; The different states considered in our trajectory analysis. The arrows indicate the most probable route from the reduced 
ensemble toward the native arrangement of disulfides. The thickest arrow denotes the rate-limiting step of the reaction. 



We have generated several MC trajectories, each spanning about 20 million time steps (see sections Methods and 
Materials) at (T = i.O, /i = 2.8) and then recorded the probability of occurrence of all allowed transitions between 
the states of Fig. |S1 The typical dispersion on the measured rates over ten MC trajectories was typically less than 
1% but augmented to about 10 % for the few very improbable transitions. Then we have considered all possible 
productive routes taking from R to 3S* in a preassigned number of transitions. The probability of occurrence for 
any such routes is obtained by multiplying the individual probabilities of any of the elementary steps. By considering 
productive routes involving an increasing number of transitions it can be established that the most probable route 
(see Fig. |H1) is i? = (0,0) (1,0) (2,0) (3,0) = 35"* which is more than ten times as likely than the second 
ranking paths which is R ^ (0, 1) — > (1,0) (2,0) 3S* . These findings seem robust against variations of the 
parameters in our model. For example, even changing the temperature to 0.8 (keping fj, = 2.8) so that the formation 
of native species is strongly favoured (see Fig. [7^) it is found that the top ranking routes taking from the reduced 
state to the native one are the same as above. Interestingly, the relative weight ratio of the top productive routes is 
analogous to the one encountered for T — 1.0; on the other hand the fact that the native formation is highly favoured 
at T = 0.8 is reflected in an increase, by several orders of magnitude, of the weight of the productive routes. 

We complete the present section by identifying the rate limiting steps of the folding process. In a sequential reaction 
the rate limiting step is straightforwardly identified as the slowest one. In the present scheme such simple analysis 
cannot be carried out due to the presence of several alternative routes that can take to the native state. An objective 
and convenient way to identify the rate limiting step in such a situation is to identify the reaction whose rate change 
affects the most the production of the species of interest, in this case 35**. Therefore, by using the measured rates 
for the transitions between the states of Fig. |S| we have integrated the associated master equations starting from 
a fully reduced population. We have then changed by 10 % each of the rates and identified the time at which the 
concentration of 35'* crosses a pre-assigned threshold value. We found that, almost independently of the threshold 
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value, the most sensitive step among all the allowed ones was (2, 0) — > 35** which is the last step of the most probable 
route leading to 35'*. In principle this may have not been the case, especially in the presence of equally important 
paths leading to 35** . The fact that the most probable routes include the rate limiting steps confirms the existence of 
a well defined succession of events taking to the native state. 

As mentioned before, based on experimental dynamical plots, Thannhauser et al. [s^ had determined that the 
rate limiting step was the 25 — * 35* one. In their study it was not possible to characterise by direct means whether 
the transition to the 35* state occurred from a 25 state comprising only native cysteine bonds, although this was 
reputed to be the most likely scenario. Our picture thus fully supports the experimental results concerning the 
25 — > 35* rate-limiting step, but also adds novel insight in the process by indicating explicitly that the most crucial 
step involves a particular 25 species, namely one with two native disulfides. In the remainder of this section we shall 
further characterise the folding process by identifying which of the native disulfides are formed in the most probable 
routes (or rate- limiting steps). This is done by following a series of individual dynamical trajectories where the native 
conformation is reached starting from a reduced state. 

Folding trajectories 

The Monte-Carlo procedure allows for a detailed study of the folding pathways and for a systematic analysis of 
the specific order of formation of disulfide bonds. By setting /i ~ 2.8 the protein is initially thcrmalised at an high 
temperature (T ~ 10) where, in our model, the disulfide bonds are completely reduced and then suddenly quenched 
to T = 1.0. For each kinetic trajectory the order of formation of the disulfide bonds was stored and a statistical 
analysis accomplished by comparing several runs. In particular we have recorded the exact type of disulfide bridges 
forming the different species (15,25) and from their relative concentration we have inferred the pathway. 

The six Cys residues of the fully reduced protein, 6, 14, 16, 22, 28, 37 are equally likely to participate in forming 
the initial disulfide bond. Although in the early folding stages the contact (14 — 16) appears quite frequently due to 
the short sequence distance between the amino-acids, after the molecule has equilibrated through a series of rapid 
internal disulfide interchange reactions, only 5 of the 15 possible 15 states exist in significant quantities. They are: 
(6-14), (6-16), (16-22), (28-37) that are respectively present in the following equilibrium concentrations: 7% 6% 12% 
and 26%. It is interesting to notice that the most common bond among the 15 ensemble is a non-native one, (28-37), 
while the only native bond that is present in significant quantities is (6-14). 

Of the 45 possible 25 states, seven occur in significant quantities. Two of them, (6-14;16-28) present in the 1% of 
the cases and (6-14,22-37) present in the 0.7% of the cases, are formed by native bonds. Other two, (6-14;16-22) and 
(6-14,28-37) present with concentration of 2% and 5% respectively, have a native and a non-native contact. Whereas 
the last three, (6-16,28-37) with a frequency of 9%, (14-22,28-37) with a frequency of 7% and (16-22,28-37) with a 
frequency of 7%, are formed by non-native disulfide bonds. The relatively high concentration of species with non- 
native states, which are not present in the most probable productive routes expected for the folding (see the previous 
section), reflect the particular experimental conditions reproduced here (induced by the choice of T and fi) in which 
the formation of the native state is not particularly favoured (see Fig. 0J)). 

The analysis of the folding pathway reported in the previous section and the statistical analysis of the dynamical 
productive pathway can be summarised in the following re-folding picture: the reduced proteins forms first the native 
contact (6 — 14) and consequently either the states (6-14;16-28) or (6-14,28-37) almost with the same probability. 
The folding proceeds with the formation of the last disulfide bond. This conversion is relatively slow in agreement 
with the finding of the previous section on the rate limiting step of the reaction. The presence of conformation with 
three scrambled disulfide bonds turns out not to be statistically significant also from this kinetic analysis. A graphical 
representation of the most probable trajectory is shown in Fig. El 

The fact that the majority of structures sampled in the quenching process involve non-native bonding patterns is not 
in contradiction with the findings of the previous section, where the most probable productive route was shown to be 
free of non-native bonds. In fact, the high concentration of structures with non-native bonds docs not automatically 
imply that they contribute significantly to the refolding "fiux" towards the native state. On the contrary, species 
involving native bonds even if they do not accumulate to high equilibrium concentration, appear to take part to the 
most efficient routes leading to the native state. 
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IS 



2S 



3S 



FIG. 9: Snapshots of one of the possible folding trajectories. The IS state involves the native contact (6-14), while the 2S 
species involves the native bonds (6-14) and (16-28). A black sphere is used to distinguish the N-terminus while cysteines 
involved in disulfide bonds are highlighted as dark segments. 



METHODS AND MATERIALS 



Structure of Hirudin 



The structure of the synthetic analogues of fragment 1-47 of hirudin HM2 in the free state was modelled on the 
NMR solution structure of natural fragment 1-47 35] , almost super-imposable to that of the corresponding segment 
(1-49) in intact hirudin HVl variant|43j (PDB code: SHIR), showing 75% sequence identity with hirudin HM2, or to 
that of HVl fragment 1-51 0| (PDB code: IHIC). The best-representative model in the NMR ensemble was selected 
using the program OLDERADO available on-line at the site http://neon.chem.le.ac.uk 



The Model 



In our model a generic conformation r{fi} of the protein is modelled as a self-avoiding chain of connected Ca 
atoms located at position , where i is the amino-acid chain index (ranging from 1 to 47). Starting from the Ca 
coordinates the peptide dihedral angles are calculated and hence, following standard geometrical rules we 
construct an effective C/3 centroid for all residues with the exception of the two end residues and the seven Glycines 
(indices: 10,18,23,25,34,40,42). 

The energy scoring function consists of two terms. The first one incorporates a standard bias toward the formation 
of native non-disulfide bonds but also disfavours the formation of non-native bonds (except for disulfide ones) and 
penalises significant deviations from the native dihedral angles. These details are known to improve the cooperatively 
of the modelled folding process [i^ lislli^ l50| . The explicit expression of this term, evaluate on a trial structure F 
is given by: 



Vn-ss{T) - 



6^ 



10 



(3) 



+ V2 



kN\2 



^1=3,46 V 
"^constraints 



(4) 



where denotes the distance of the ith and jth Ca atoms in the trial structure F; a superscript N is used to denote 
analogous quantities pertaining to the native structure. The contact map A is computed by considering as threshold 
a distance of 8 A between the Ca atoms. The prime in the summation indicates that no contribution is considered 
if both the amino-acids are Cys. Finally, the angular term has been constructed using the standard dihedral angles. 
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9 and The coefficients Vo,Vi and V2 are used to control the strength of interactions and are set equal 

to 1, 5 and 1 energy units, respectively. The last term in the expression, Vconstraints, is used to enforce a series 
of knowledge-based constraints whose violation is penalised through an "infinite" energy penalty (that is through a 
rejection of the violating conformation). The constraints are as follows: (1) the distance between two consecutive Cq 
atoms must remain in the interval 3.7-3.9 A, (2) the distance between two non consecutive Ca atoms must be greater 
than 4 A, (3) the distance between any two Cp must be greater than 2 A and (4) the distance between any two Ca 
and Cp centroids must be greater than 2 A. 

The second term of the Hamiltonian, Vss rewards the formation of disulfides, irrespective of whether they are native 
or not. As explained in the next subsection, each proposed configuration, described in terms of C^'s and C/j's, comes 
with a set of three putative bonds among the six cysteines. Of these putative bonds only those among residues whose 
C^'s are a separation smaller than 5 A are considered to be effectively present and hence give a contribution equal to 
—fi to the total energy. 

Monte-Carlo Method 

As mentioned in the text, we used Monte Carlo dynamics for studying the folding process. At each Monte-Carlo step 
the current structure is distorted through local deformations |5 j | based on two equally-probable moves: (a) single-bead 
move: a random a^p/ia-carbon is chosen and is displaced randomly by at most 1 A along each Cartesian direction, 
(b) crankshaft move: two sites, i and j, with sequence separation at most 6 are chosen and all the sites between them 
are rotated around the axis joining i and j by an angle chosen randomly in the interval — ^ < < jq- As mentioned 
before, besides this structural rearrangement, at each Monte Carlo step we also associate a randomly-chosen pairing 
pattern for the six cysteines so that each of them is involved in a putative disulfide bond. These bonds are termed 
putative because, to satisfy detailed balance, the pairing assignment is done blindly that is without inspecting whether 
a given pair of cysteines is at a distance compatible with the existence of a disulfide bond. One then checks whether 
(irrespective of the pairing distances) the proposed bonding pattern is compatible with the undistorted configuration, 
i.e. if it respects the rules of section 2.1 about disulfide formation and thiol/disulfide coupling. If not the configuration 
is rejected, the Monte Carlo clock is advanced and a new distortion and bonding pattern is considered. Otherwise 
one proceeds as in ordinary Metropolis schemes after having calculated the energy of the proposed configuration. It 
is important to notice that this latter step involves the inspection of the distance of the putatively bonded cysteines 
to reward only those bonds that are geometrically feasible. 

The efficiency of the Monte-Carlo algorithm to study the thermodynamics was enhanced by the multiple Markov- 
chain sampling schemels^, a method that has proved quite effective in exploring the low temperature phase diagrams 
of proteins and interacting polymers. All the runs have been performed by covering the temperature range of interest, 
T = [0.2, 1.5], with 20 Markov chains uniformly-spaced in temperature. 

The data obtained in the multiple-Markov-chain runs at different values of T and fj, were further processed through 
a generalised multiple histogram technique inspired by the work of ref. [s^. This strategy allowed to reconstruct 
faithfully the density of states (number of configurations) in the multi-dimensional space of reaction coordinates 
constituted by Vn-ss and the number of correct (native) and wrong (non-native) disulfides: {nc,nw)- The data from 
the different runs were combined so to minimize the error propagation on the density of states. The typical uncertainty 
of the reconstructed free-energy at the values of T and /i considered here is about 0.5 energy unit (this estimate follows 
from the analysis of free energy dispersion when half of the collected data is used) . By these means it was possible to 
obtain the total energy and specific heat curves of Fig. [Sjand the free energy profiles of Fig. 

CONCLUSIONS 

We have proposed a theoretical framework to model and study the folding of proteins containing disulfide bonds. 
The approach is based on the knowledge of the native state of a protein but contains an appropriate term to account 
for the possibility that native or non-native disulfide bonds can form. The main advantage of the model proposed 
here is its simplicity which allows for a detailed description of all the folding pathways through the monitoring of the 
correct /incorrect contact formation. 

The model has been validated by investigating the debated re-folding pathways of Hirudin which has been object of 
several experimental studies. It was shown that there exists a region in our two-dimensional parameter space where 
the rates of conversions between different oxidised species are in good agreement with experimental measurements 
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|34| . Starting from this successful comparison we have then attempted a detailed characterisation of the whole folding 
process. 

At a coarse-grained level our results is consistent with the scenario described by Thannhauser et al. js^ suggesting 
that the rate limiting step turns out to be 25* iS* . Our approach, that allow for a precise identification of the 
formed contacts, shows clearly that the 25 state appears to involve native intermediates, possibility that was reputed 
as the most probable situation in ref. [s^l although experimentally it was impossible to show it directly. The analysis of 
several folding trajectories also allowed the identification of the most probable folding route and the typical associated 
succession of formation of disulfide bridges. Furthermore, the thermodynamics of the system was elucidated by using 
statistical mechanical techniques to reconstruct the free energy profiles for the whole system and also for the different 
oxidised species. 

Due to its simplicity, the proposed model, cannot capture those aspects of the folding process that result from 
the delicate interplay of amino acid specific interactions. The successful comparison of the theoretical predictions for 
hirudin with the experimental findings suggests that also, for disulfide containing proteins, suitable topology-based 
models can be profitably used to elucidate the folding pathways, even in the presence of non-native intermediates. 
Thus, the present approach, which is general and not specifically tailored for hirudin, ought to be straightforwardly 
applicable in other contexts providing a useful complement of experimental techniques in the characterisation of the 
folding process in the presence of disulfide bonds. 
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